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Abstract: Systematic testing of every single component and interface is undoubtedly an important 
measure to handle the complex nature of current software systems. However, this comes with often 
neglected computational costs. The aim of this paper is therefore to cut time and resource needs by 
predictive testing, i.e., predicting test failures with machine learning using a surprisingly simple 
statistical feature representation. Furthermore, we present the first open research benchmark for 
predictive testing to enable and foster future research in this area. 
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1 Introduction 


Software testing is crucial for the success of a professional software project. However, 
with a growing code complexity, regression testing for changes gets more time and 
resource consuming resulting in hours of software tests for evaluating code changes for 
every new commit. A time delay, which is simply impractical to handle during 
development and a waste of expensive resources. A classical solution to the problem is to 
track test dependencies and code coverage of each test over time. However, this is time- 
consuming and sometimes impossible to achieve especially within heterogeneous 
repositories with several language barriers [MSPC19]. 


Because predictive testing is heavily project dependent, our contribution to the field of 
predictive testing is not a new machine learning approach, but rather a curated and open 
dataset to allow future comparison and evaluation of predictive testing methods. 


In summary our main contributions are as follows: 


e Repository containing all the databases and code: 
https://gitlab.com/nexxtnit/predictive_testing 

e Ready-to-use database of test results and metadata on a commit basis for four 
open-source projects (Chap. 2). 

e A new baseline method for predictive testing based on simple statistics (Chap. 
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3). 

e Evaluation of a baseline approach (statistical feature extraction) for predictive 
testing to set the bar for future work (Chap. 5.2) 

e In-depth analysis of feature relevance on the new dataset to provide insights 
concerning major relevant statistics (Chap. 5.3). 
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Figure 1 Illustration of predictive testing using supervised machine learning: past code changes and 
corresponding test results are used for training a machine learning model able to predict test failure 
probabilities for future code changes. 


2 OpenPredict - A New Dataset for Predictive Testing 


The problem of current papers and implementations of predictive testing is that the data is 
often very project specific and not granular enough to make general assumptions. Because 
predictive testing is often tuned to specific projects, publishing only the machine learning 
model has no benefit for further research. Furthermore, releasing predictive testing 
training data of closed-source implementations could reveal insights about the software 


product. The data used from continuous integration pipelines is often taken as is, there is 
no possibility in tuning the scope of tests or choose the granularity of test execution. 


In contrast, the goal of this work is to create an open research dataset for experiments in 
the field of predictive testing that can be systematically compared also in future work. Our 
dataset tries to illustrate the minimal requirements of data that needs to be available to 
implement predictive testing for your own project. The dataset itself was generated from 
open-source projects and could be theoretically extended further with the data extraction 
pipeline we build. 


Our data is comprised of four curated open-source repositories of different nature. To ease 
evaluation and allow stable statistical evaluation, we used the following criteria to select 
repositories: 


Written in Python to allow for applying language-specific features in the future 
Minimum of 400 commits to get enough data points over time 

Minimum of 10 tests to allow for multi-task prediction 

Testing happens with pytest or tox to ease the automatic data extraction 


These requirements, yielded in four repositories outlined in Table 1. In the following, we 
explain the process of data extraction and dataset curation in detail. 


Project Description Commits Testing Link 

Flit ” Simplified packaging of 1017 Tox with Github 
Python modules” [Pypa21] Pytest 

Mock ” The Python mock 1277 Pytest Github 
library” [Test21] 

Flake8 = ”’flake8 is a python tool 2074 Tox with Github 
that glues together Pytest 
pycodestyle, 


pyflakes, mccabe, and 
thirdparty plugins to check 
the style and quality of 
some python code.” 
[Pyeq21] 
Nox ” Flexible test automation 439 Pytest Github 
for Python” [Thea22] 


Table 1 Repositories used in our dataset for predictive testing. 


Data Extraction for Test Change Representation: Our method for data extraction is 
illustrated in Figure 2. The approach is identical for all four projects being tested in this 
work. All information is gained from the project repository in an automatic fashion. The 
test suite is extracted from the existing tests and test variations of the repository. We do 
not generate synthetic code defects as in [Lund19], to obtain more test results. The commit 
hash provides us with a unique identifier for the test result as well as the related code 


changes. 


In contrast to [PaPr21],our predictive testing paradigm is to forecast the test result of every 
commit. Every commit in the repository gets tested by every test in the test suite. As a 
result of that approach, we gain precise data on code changes and test results. 
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Figure 2 Data extraction method, every commit is tested with all test variations. The data gathered 
from the test run and the repository is saved in an SQL database keeping the correspondence 
between them. 


Database: All meta information and main statistical features are stored in an SQL 
database, which is outlined in Figure 3 and identical for all projects. The scripts used for 
automatically transforming a git repository to the database are available in our associated 
code repository. 
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Figure 3 Database architecture, the commit-id, based on the commit hash, is used for 
synchronization throughout the database 


Test Suite: The goal of our predictive testing approach is to predict the test result 

ye {success, fail} of all tests TY) in the test suite. However, tests can change over 
time. and these variations require special care. We need to add each test variation k of a 
test TY) as an additional target yd ) to prevent the model from learning wrong 
statistical dependencies of older code changes and test results not related to the modified 
tests. We therefore include all changes of a test and create a separate test target for each 
change (see Table 2). All tasks yY) for a fixed j will be not statistically independent, 
since test changes will consist of tiny changes over time. This fact can be exploited by a 
multi-task approach for learning. 


With including every change in a test case as an own test variation, we increased the 
amount of test cases from 12 tests to 329 - phrasing each of these variations as 
independent prediction tasks. 


Tests Newest Test Variations in Test Suite 
Commit 
test_parametrize.py test_parametrize_354.py 


test_parametrize_359.py 
test_parametrize_367.py 


test_version.py test_version_348.py 
test_version_367.py 
test_command.py test_command_4.py 


test_command_4.py 
test_command_9.py 
test_command_24.py 


Table 2 Left: tests from the newest commit of Flit; Right: test variations 


For the flit project the dataset statistics can be checked in Table 3. 


Version All Training Set Testing Set 

(Commits) (Passed/Failed) (P/F) (P/F) 

Flit. V0.0 (634) 208°586 183°253 25°333 
(337391/175°195)  (28472/154’781) (4919/20°414) 

Flit.V1.0.WV (538) 1777002 157°920 19°082 
(28’764/148°238) _(28’764/143’611) (3812/15’270) 

Flit.V1.0 (538) 10’760 9600 1160 
(7591/3169) (6830/2770) (761/399) 


Table 3 Dataset statistics of the Flit dataset variations. The newest 10% of the commits are for the 
testing set, the other 90% go in the training set, see Chapter 5.1 for details. 


3 Predictive Testing with Statistical Features 


In the following, we outline our baseline machine learning approach for predictive testing, 
which will be evaluated in Sect. 5. 


A dataset consists of three parts, the one-hot encoded test results, the one-hot encoded file 
changes, and metadata per commit. As label data the pass/fail column is used. A snippet 
of the complete dataset can be examined in Table 4. 


CommitID _ test_build_422.py test_build_465.py__... pyproject.toml 
1028 1 ) Bos 1 
1028 0 1 ins 1 


FileComplexity | AddedLines DeletedLines PassFail 
27 5 5 0 
27 5 5 0 


Table 4 Complete dataset from the Flit database for machine learning. All file changes and tests are 
one-hot encoded. The PassFail column is the label we try to predict, and the CommitID will be 
removed for training and testing. 


As a classification model we use a random forest classifier to predict the outcome of a 
testcase as pass or fail. The random forest classifier consists of several decision trees 
trained with a random fraction of the training data. A decision tree is made up of inner 


nodes (split nodes) and leaves. At each split node a single feature is evaluated since we 
make use of simple decision stumps. After traversing a tree, the final estimate is given by 
an empirical probability stored at each leaf during learning. A random forest decision is 
determined by the average of all trees estimates and a majority vote is used for the final 
classification decision. 


Due to the simple decision stumps used, a random forest classifier can easily cope with 
features of different magnitudes. Furthermore, we also exploit multi-task learning of all 
different target tasks (every test variation is a single target) since the tasks are highly 
related. This is done in our case by a one-hot-encoding of the task itself and a single binary 
variable as a target. Some branches in a single decision tree can therefore focus on an 
arbitrary set of tasks by using their one-hot encodings as features in early split nodes. This 
is beneficial compared to learning each predictive testing task independently from each 
other with only a few training examples available (each test variation might only relate to 
a few commits). 


For our experiments, we use a random forest with 1000 decision trees and a Gini split 
criterion without a restriction of a maximum number of examples in a leaf. 


4 Related Work 


Reducing the number of executed tests in regression test suites is an established method 
for reducing costs and time [YoHal2]. Test selection is a very efficient method for 
reducing the executed tests [LHSL16]. Combining test selection with machine learning is 
the usually referred to as predictive Testing. Despite its potential, relevance for sustainable 
development, and partial adoption at large tech companies [RiMS21], predictive testing is 
a rather unexplored area of research. In [MSPC19], Machalica et al. also used historical 
data to implement a data-driven test selection strategy. This allowed the authors to reduce 
the total number of test executions by 2/3 while still finding 95% of all failed tests. 
Unfortunately, no open dataset was provided, rendering reproduction of the results 
impossible. [Lund19] implemented a predictive test selection tool by creating synthetic 
test results. They altered the code with small synthetic modifications to obtain labeled data 
with a decent number of test failures. Classifying tests results was done using a random 
forest classifier, which allows for reducing test executions by 50% at again 95% true 
positive detection rate for all failures. A severe disadvantage of this approach is that its 
performance highly depends on the realistic nature of the synthetic code modifications. In 
complex software systems, these are difficult to design and might be in addition subject to 
change over time. In contrast, the approach of [SKPS20] uses simple text similarity 
metrics to rank tests related to a code change. In comparison to the impressive dataset 
gathered in [YBKB22] from 25 open-source projects, we concentrate on the basic needs 
to implement predictive testing. In their work, Java projects are exclusively used along 
with their continuous integration history. As a result, many additional data features like 
code coverage are available. Furthermore, using only continuous integration data has the 


disadvantage that the granularity of code changes cannot be chosen by oneself. The code 
changes do differ from build to build and can range from one to several commits. In our 
work, we concentrated on changes and test results per single commit - providing predictive 
testing at a fine-grained level. However, the three “high level features” having the most 
impact on prediction in [YBKB22]are like the ones used in our work. The work of Sharif 
et al. [ShML21] focuses on deep learning techniques for regression test prediction. They 
show the benefits of their approach especially for large datasets. 


A related problem to ours is test suite failure prediction [PaPr21],which tries to predict the 
failure of the whole test suite rather than for each test case individually. A further overview 
of predictive methods for software engineering is given in [YXLB22]. 


5 Experiments 


In the following, we evaluate our baseline approach on OpenPredict showing the power 
of simple statistical features for predictive testing. 


5.1. Experimental Setup 


For experimenting with machine learning methods for test case prediction, the dataset 
needs to be separated first in training and testing data. Instead of selecting random cases 
from the dataset, we decided to split the data by its commit-id, i.e., trying to predict the 
behaviour of newer commit from the history of older ones. This reflects the use case for 
predictive testing, where the model is trained continuously and applied to the most recent 
commit. While analyzing the databases, it became obvious that some commits, did not 
have any test results. This is because early commits in the repository have often not been 
able to execute tests at all, therefore they contain no test results. 


The dataset is split into testing and training data by the CommitID. From the sum of 
commits containing test data, the newest 10% are used for testing, the other 90% for 
training. Please note that the commit-id is of course not used as a feature. 


Since our prediction tasks are binary, we used an ROC analysis as the main performance 
metric for our experiments. Instead of calculating ROC results for each task, we evaluated 
the performance of the random forest classifier, which provides us with an aggregated 
performance metric over all test cases. 


Dataset Specifications Features 

Flit.V0.0 e no metadata e test 
e all commits, starting at init e changed files in 
e all test cases, no filtering of commit 


variations e test result 


Flit.V1.0.WV e metadata FileComplexity and e test 


AddedLines e changed files in 
e starting at commit tag V/.0 commit 
e all test cases, no filtering of e test result 
variations e FileComplexity 
e AddedLines 
Flit.V1.0 e Metadata FileComplexity and e test 
AddedLines added e changed files in 
e Starting at commit tag V1.0 commit 
e Combining all test variations to one e test result 
test case e FileComplexity 


e AddedLines 


Table 5 Dataset variations of the Flit database: Flit.V0.0 has all raw data and no metadata, 
Flit.V1.0.WV has metadata added, Flit.V1.0 contains structured test results and metadata. 


5.2 Analysis on the flit dataset 


First, we evaluate some dataset aspect on the Flit dataset with our baseline approach. In 
particular, we want to evaluate the impact of skipping the very first commits and dealing 
with test case variations as individual tasks (see Figure 5 for an illustration). Our 
experiments are performed with three different versions of the Flit dataset in Table 5. The 
results are given in Figure 4. 


The AUC improves about 6% when adding more relevant data to the dataset and about 
8% when structuring the test results of the testcases according to their commit-id. 


Receiver Operating Characteristic (ROC) 
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Figure 4 Different classification for the three dataset variations of Flit from Tab. 5 


For improving classification an attempt is made to restructure the data of all test case 
variations. Because all test case variations test the same code functionality, logically it 
makes sense to only use the result of one test variation at a time. At every commit the 
result of the matching test variation must be used. 


As a result of the previous observations, a test variation is used as long as the commit-id 
of the current commit to test, is one greater than the commit-id of the test variation (Figure 
5). 


CommitID © TestName Passed Failed Skipped 


TestSuite 


Figure 5 Test variations being restructured for a dataset. The Results of every test variation is used 
if the commit currently testing is greater than the commit the test variation is created. 


5.3. In-depth analysis of feature relevance 


The metadata features we identified in Sect. 3 are contributing strongly to the 
classification. The importance is visible in Table 6. The feature importance is measured 
as the impurity decrease within each tree. The absolute value of the accumulated impurity 
decrease is the importance of each feature. Besides the obvious test cases as features, the 
metadata features are the ones with the biggest impact. 


Nr. Importance Feature 

1 0.137 test_importable_.py 
2 0.119 test_metadata_.py 

3 0.098 FileComplexity 

4 0.055 test_config_.py 

5 0.052 AddedLines 

6 0.049 DeletedLines 


Table 6 Feature importance of Flit.V1.0 dataset. The Metadata features FileComplexity, 
AddedLines and DeletedLines are within the 6 most important features. 


When analyzing the classification results, the question emerged what probably would 
happen, if the commit-id is kept as a feature in the dataset. The results can be seen in 
Figure 6. 
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Figure 6 Importance of commits included in dataset 


Flit.V1.0: The blue line in the roc-analysis (Figure 6) and the classification matrix on the 
left border, is the best performing dataset from Chap. 5.1 (Flit.V1.0). 


Flit.V1.0.WCID: The orange line and second classification matrix from the left (Figure 
6), is the same dataset as Flit.V1.0 but with the feature commit-id added. 


The AUC of the random forest classifier is improving by around 5% by adding the 
commit-id as a feature to the dataset. 


The reason for the performance improvement can be found in a deeper analysis of the 
nodes of the feature commit-id in the balanced random forest classifier, visible in Figure 
7. 


Figure 7 Node for feature commit-id in tree of balanced random forest classifier 


The whole dataset contains around 1000 commits. Starting with commit-id 1 for the 
oldest commit, the newest commit has the id 1000. The node in Figure 7 is separating the 
commits for the commit-id value <= 635, 5. Because our test set contains only the 
newest 10% of the commits, the commit-id in the test set will always be greater than 
635,5. This observation leads to the idea that fewer commits in the dataset could actually 
benefit the classifier. 


Flit.V2.0: The purple line in the ROC analysis and the classification matrix second from 
the right border, is the same dataset as the first one. The commit-id is not included as a 
feature, but commits were only used starting with commit-tag V2.0. 


Flit.V3.0: The cyan line in the ROC analysis and the classification matrix first from the 
right, is the same dataset as the first one. The commit-id is not included as a feature, but 
commits were only used starting with commit-tag V3.0. 


In conclusion, it becomes clear that a logical connection of the code, represented with 
version tags, is important to train a machine learning model for predictive testing. This 
leads to the theory that not the whole history of a repository is necessary for predictive 
testing, but rather the last one or two versions of the project code. 


Version All Training Set Testing Set 
(Commits) (Passed/Failed) (P/F) (P/F) 
Flit.V1.0 (538) 10°760 9600 1160 
(7591/3169) (6830/2770) (761/399) 
Flit.V1.0.WCID 10’760 9600 1160 
(538) (7591/3169) (6830/2770) (761/399) 
Flit.V2.0 (347) 6940 6240 700 
(4948/1992) (4444/1796) (504/196) 
Flit.V3.0 (193) 3860 3460 400 


(2678/1182) (2382/1078) (296/104) 


Table 7 Statistical features of the datasets used to analyze the commit-id feature 


5.4 Full evaluation on OpenPredict 


In Table 8 and Figure 8 the size of all datasets and the performance of the classifiers for 
the different projects can be examined. All four projects we tested in our work provided 
an acceptable classification. The visible differences in Figure 8 are explainable because 
of the code complexity, size, and test amount of the different projects. The worst 
classification project, Flake8, also offers space to further improve classification 
performance by increasing the commit tag. 


Training a predictive testing model for different repositories shows that our approach is 
not project specific and offers the possibility to implement predictive testing in a general 
manner, as long as the correct data is provided. 


Version All Training Set Testing Set 
(Commits) (Passed/Failed) (P/F) (P/F) 
Flit.V2.0 (347) 6940 6240 700 
(4948/1992) (4444/1796) (504/196) 
Mock.V0.8 (464) 4640 4170 470 
(2636/2004) (2254/1916) (382/88) 
Nox.V0.0 (436) 6976 6272 704 
(2964/4012) (2486/3786) (478/226) 
Flake8.V3.7 (613) 277585 24°795 2790 
(107178/17°407) _ (8564/16°231) (1614/1176) 


Table 8 Statistical features of the datasets of all different projects 
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Figure 8 Overview of the best possible classification with optimizations discovered in this paper. 


Conclusions and Future Work 


We presented the first open research dataset for predictive testing on a fine-granular basis. 
With this minimalistic approach we identified the core features necessary to predict test 
results. With a detailed analysis of feature importance, it was even possible to conclude 
that not the whole code history is necessary for gathering training data. Furthermore, we 
developed a baseline approach based on statistical features from the repository and a multi- 
task random forest as predictor. Applied to the dataset, the approach showed surprisingly 
high prediction performance, even though no detailed code analysis was used to enrich the 
feature representation of the related code changes. 


There is a multitude of research ideas to boost the performance of predictive testing, 
including using recent language models trained on source code [CTJY21] or even directly 
on code changes. Although we use a simple multi-task technique (using one-hot-encoded 
task descriptions) for prediction, this representation lacks the information that variations 
of the same underlying unit test are related differently based on their commit history. In 
addition, the baseline approach and related ones should be further evaluated in a 
continuous learning setting, where the model is learned from last K commits and applied 
to the most recent one. 
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